Goto

Collaborating Authors

 low rank approximation


Efficient Low Rank Attention for Long-Context Inference in Large Language Models

Neural Information Processing Systems

As the length of input text increases, the key-value (KV) cache in LLMs imposes prohibitive GPU memory costs and limits long-context inference on resource constrained devices. Existing approaches, such as KV quantization and pruning, reduce memory usage but suffer from numerical precision loss or suboptimal retention of key-value pairs. In this work, Low Rank Query and Key attention (LRQK) is introduced, a two-stage framework that jointly decomposes full-precision query and key matrices into compact rank-r factors during the prefill stage, and then employs these low-dimensional projections to compute proxy attention scores in O(lr) time at each decode step. By selecting only the top-k tokens and a small fixed set of recent tokens, LRQK employs a mixed GPU-CPU cache with a hitand-miss mechanism where only missing full-precision KV pairs are transferred, thereby preserving exact attention outputs while reducing CPU-GPU data movement.






Sublinear Time Low-Rank Approximation of Distance Matrices

Neural Information Processing Systems

Such distance matrices are commonly computed in software packages and have applications to learning image manifolds, handwriting recognition, and multi-dimensional unfolding, among other things. In an attempt to reduce their description size, we study low rank approximation of such matrices. Our main result is to show that for any underlying distance metric $d$, it is possible to achieve an additive error low rank approximation in sublinear time. We note that it is provably impossible to achieve such a guarantee in sublinear time for arbitrary matrices $\AA$, and our proof exploits special properties of distance matrices. We develop a recursive algorithm based on additive projection-cost preserving sampling.



Hardness of Low Rank Approximation of Entrywise Transformed Matrix Products

Neural Information Processing Systems

Some related lower bounds include the work of Backurs et al. [2017] that solving kernel Support V ector Machines (SVM), ridge regression, or Principal Component Analysis (PCA) problems to high accuracy or approximating kernel density estimates up to a constant factor for kernels with